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ABSTRACT 

We present a novel per-dimension learning rate method for 
gradient descent called ADADELTA. The method dynami- 
cally adapts over time using only first order information and 
has minimal computational overhead beyond vanilla stochas- 
tic gradient descent. The method requires no manual tuning of 
a learning rate and appears robust to noisy gradient informa- 
tion, different model architecture choices, various data modal- 
ities and selection of hyperparameters. We show promising 
results compared to other methods on the MNIST digit clas- 
sification task using a single machine and on a large scale 
voice dataset in a distributed cluster environment. 

Index Terms — Adaptive Learning Rates, Machine Learn- 
ing, Neural Networks, Gradient Descent 

1. INTRODUCTION 

The aim of many machine learning methods is to update a 
set of parameters x in order to optimize an objective function 
f(x). This often involves some iterative procedure which ap- 
plies changes to the parameters, Aa; at each iteration of the 
algorithm. Denoting the parameters at the t-th iteration as Xt, 
this simple update rule becomes: 

Ax t (1) 



Xt+l 



x t 



In this paper we consider gradient descent algorithms which 
attempt to optimize the objective function by following the 
steepest descent direction given by the negative of the gradi- 
ent g t . This general approach can be applied to update any 
parameters for which a derivative can be obtained: 

Ax t = -Tjg t (2) 

where g t is the gradient of the parameters at the t-th iteration 
9 ' and r] is a learning rate which controls how large of 
a step to take in the direction of the negative gradient. Fol- 
lowing this negative gradient for each new sample or batch 
of samples chosen from the dataset gives a local estimate 
of which direction minimizes the cost and is referred to as 
stochastic gradient descent (SGD) jTj. While often simple to 
derive the gradients for each parameter analytically, the gradi- 
ent descent algorithm requires the learning rate hyperparam- 
eter to be chosen. 

*This work was done while Matthew D. Zeiler was an intern at Google. 



Setting the learning rate typically involves a tuning pro- 
cedure in which the highest possible learning rate is chosen 
by hand. Choosing higher than this rate can cause the system 
to diverge in terms of the objective function, and choosing 
this rate too low results in slow learning. Determining a good 
learning rate becomes more of an art than science for many 
problems. 

This work attempts to alleviate the task of choosing a 
learning rate by introducing a new dynamic learning rate that 
is computed on a per-dimension basis using only first order 
information. This requires a trivial amount of extra compu- 
tation per iteration over gradient descent. Additionally, while 
there are some hyper parameters in this method, we has found 
their selection to not drastically alter the results. The benefits 
of this approach are as follows: 

• no manual setting of a learning rate. 

• insensitive to hyperparameters. 

• separate dynamic learning rate per-dimension. 

• minimal computation over gradient descent. 

• robust to large gradients, noise and architecture choice. 

• applicable in both local or distributed environments. 

2. RELATED WORK 

There are many modifications to the gradient descent algo- 
rithm. The most powerful such modification is Newton's 
method which requires second order derivatives of the cost 

function: . TT i 

Ax t = H t 1 g t (3) 

where is the inverse of the Hessian matrix of second 
derivatives computed at iteration t. This determines the op- 
timal step size to take for quadratic problems, but unfortu- 
nately is prohibitive to compute in practice for large models. 
Therefore, many additional approaches have been proposed 
to either improve the use of first order information or to ap- 
proximate the second order information. 

2.1. Learning Rate Annealing 

There have been several attempts to use heuristics for estimat- 
ing a good learning rate at each iteration of gradient descent. 
These either attempt to speed up learning when suitable or to 
slow down learning near a local minima. Here we consider 
the latter. 



When gradient descent nears a minima in the cost sur- 
face, the parameter values can oscillate back and forth around 
the minima. One method to prevent this is to slow down the 
parameter updates by decreasing the learning rate. This can 
be done manually when the validation accuracy appears to 
plateau. Alternatively, learning rate schedules have been pro- 
posed [ 1 1 to automatically anneal the learning rate based on 
how many epochs through the data have been done. These ap- 
proaches typically add additional hyperparameters to control 
how quickly the learning rate decays. 

2.2. Per-Dimension First Order Methods 

The heuristic annealing procedure discussed above modifies 
a single global learning rate that applies to all dimensions of 
the parameters. Since each dimension of the parameter vector 
can relate to the overall cost in completely different ways, 
a per-dimension learning rate that can compensate for these 
differences is often advantageous. 

2.2.1. Momentum 

One method of speeding up training per-dimension is the mo- 
mentum method J2). This is perhaps the simplest extension to 
SGD that has been successfully used for decades. The main 
idea behind momentum is to accelerate progress along dimen- 
sions in which gradient consistently point in the same direc- 
tion and to slow progress along dimensions where the sign 
of the gradient continues to change. This is done by keeping 
track of past parameter updates with an exponential decay: 

Ax t = pAxt-i - rjg t (4) 
where p is a constant controlling the decay of the previous 
parameter updates. This gives a nice intuitive improvement 
over SGD when optimizing difficult cost surfaces such as a 
long narrow valley. The gradients along the valley, despite 
being much smaller than the gradients across the valley, are 
typically in the same direction and thus the momentum term 
accumulates to speed up progress. In SGD the progress along 
the valley would be slow since the gradient magnitude is small 
and the fixed global learning rate shared by all dimensions 
cannot speed up progress. Choosing a higher learning rate 
for SGD may help but the dimension across the valley would 
then also make larger parameter updates which could lead 
to oscillations back as forth across the valley. These oscil- 
lations are mitigated when using momentum because the sign 
of the gradient changes and thus the momentum term damps 
down these updates to slow progress across the valley. Again, 
this occurs per-dimension and therefore the progress along the 
valley is unaffected. 

2.2.2. ADAGRAD 

A recent first order method called ADAGRAD 1 3 1 has shown 
remarkably good results on large scale learning tasks in a dis- 
tributed environment |4j. This method relies on only first 



order information but has some properties of second order 
methods and annealing. The update rule for ADAGRAD is 
as follows: A rj 



Ax t = - 
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(5) 
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Here the denominator computes the 11 norm of all previous 
gradients on a per-dimension basis and 77 is a global learning 
rate shared by all dimensions. 

While there is the hand tuned global learning rate, each 
dimension has its own dynamic rate. Since this dynamic rate 
grows with the inverse of the gradient magnitudes, large gra- 
dients have smaller learning rates and small gradients have 
large learning rates. This has the nice property, as in second 
order methods, that the progress along each dimension evens 
out over time. This is very beneficial for training deep neu- 
ral networks since the scale of the gradients in each layer is 
often different by several orders of magnitude, so the optimal 
learning rate should take that into account. Additionally, this 
accumulation of gradient in the denominator has the same ef- 
fects as annealing, reducing the learning rate over time. 

Since the magnitudes of gradients are factored out in 
ADAGRAD, this method can be sensitive to initial conditions 
of the parameters and the corresponding gradients. If the ini- 
tial gradients are large, the learning rates will be low for the 
remainder of training. This can be combatted by increasing 
the global learning rate, making the ADAGRAD method sen- 
sitive to the choice of learning rate. Also, due to the continual 
accumulation of squared gradients in the denominator, the 
learning rate will continue to decrease throughout training, 
eventually decreasing to zero and stopping training com- 
pletely. We created our ADADELTA method to overcome the 
sensitivity to the hyperparameter selection as well as to avoid 
the continual decay of the learning rates. 

2.3. Methods Using Second Order Information 

Whereas the above methods only utilized gradient and func- 
tion evaluations in order to optimize the objective, second 
order methods such as Newton's method or quasi-Newtons 
methods make use of the Hessian matrix or approximations 
to it. While this provides additional curvature information 
useful for optimization, computing accurate second order in- 
formation is often expensive. 

Since computing the entire Hessian matrix of second 
derivatives is too computationally expensive for large models, 
Becker and LecCun [5 ] proposed a diagonal approximation to 
the Hessian. This diagonal approximation can be computed 
with one additional forward and back-propagation through 
the model, effectively doubling the computation over SGD. 
Once the diagonal of the Hessian is computed, diag(H), the 
update rule becomes: 

Axt = — 7WW~, — 9t (6) 
|diag(il t )| +(i 

where the absolute value of this diagonal Hessian is used to 
ensure the negative gradient direction is always followed and 



p is a small constant to improve the conditioning of the Hes- 
sian for regions of small curvature. 

A recent method by Schaul et al. [6| incorporating the 
diagonal Hessian with ADAGRAD-like terms has been intro- 
duced to alleviate the need for hand specified learning rates. 
This method uses the following update rule: 



1 



E[g t - 



|diag(ff t )l E[gj_ w , t ] 



fjt 



(7) 



where E[g t - w -.t] is the expected value of the previous w gra- 
dients and E[g^_ w:t ] is the expected value of squared gradi- 
ents over the same window w. Schaul et al. also introduce a 
heuristic for this window size w (see |6| for more details). 

3. ADADELTA METHOD 

The idea presented in this paper was derived from ADA- 
GRAD [3 1 in order to improve upon the two main draw- 
backs of the method: 1) the continual decay of learning rates 
throughout training, and 2) the need for a manually selected 
global learning rate. After deriving our method we noticed 
several similarities to Schaul et al. |6|, which will be com- 
pared to below. 

In the ADAGRAD method the denominator accumulates 
the squared gradients from each iteration starting at the be- 
ginning of training. Since each term is positive, this accumu- 
lated sum continues to grow throughout training, effectively 
shrinking the learning rate on each dimension. After many it- 
erations, this learning rate will become infinitesimally small. 

3.1. Idea 1: Accumulate Over Window 

Instead of accumulating the sum of squared gradients over all 
time, we restricted the window of past gradients that are ac- 
cumulated to be some fixed size w (instead of size t where 
t is the current iteration as in ADAGRAD). With this win- 
dowed accumulation the denominator of ADAGRAD cannot 
accumulate to infinity and instead becomes a local estimate 
using recent gradients. This ensures that learning continues 
to make progress even after many iterations of updates have 
been done. 

Since storing w previous squared gradients is inefficient, 
our methods implements this accumulation as an exponen- 
tially decaying average of the squared gradients. Assume at 
time t this running average is E[g 2 ] t then we compute: 



E[g%=pE[g%_ l + {l-p) 



(8) 



where p is a decay constant similar to that used in the momen- 
tum method. Since we require the square root of this quantity 
in the parameter updates, this effectively becomes the RMS 
of previous squared gradients up to time t: 

RMS[. 9 ] t - VE[g 2 ]t + e (9) 
where a constant e is added to better condition the denomina- 
tor as in [5 1 . The resulting parameter update is then: 



Ax t 
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(10) 



Algorithm 1 Computing ADADELTA update at time t 
Require: Decay rate p, Constant e 
Require: Initial parameter X\ 
1: Initialize accumulation variables E[g 2 ]o = 0, E [Ax 2 ] = 
for t — 1 : T do %% Loop over # of updates 
Compute Gradient: g t 

Accumulate Gradient: E[g 2 ] t = pE[g 2 ] t -i + (1 - p)g 2 t 



Compute Update: Ax t = RMS[g]t gt 

Accumulate Updates: E[Ax 2 ] t = pE[Ax 2 \ 
Apply Update: xt+i = xt + Axt 
end for 



- 1 +(l-p)Ax 2 t 



3.2. Idea 2: Correct Units with Hessian Approximation 

When considering the parameter updates, Ax, being applied 
to x, the units should match. That is, if the parameter had 
some hypothetical units, the changes to the parameter should 
be changes in those units as well. When considering SGD, 
Momentum, or ADAGRAD, we can see that this is not the 
case. The units in SGD and Momentum relate to the gradient, 
not the parameter: 



units of Aa; oc units of g cx ^J- cx 

ox units of x 



(ID 



assuming the cost function, /, is unitless. ADAGRAD also 
does not have correct units since the update involves ratios of 
gradient quantities, hence the update is unitless. 

In contrast, second order methods such as Newton's 
method that use Hessian information or an approximation 
to the Hessian do have the correct units for the parameter 
updates: 

___ (12) 

clx' 2 



Ax cx H 1 g cx -j^j cx units of x 



Noticing this mismatch of units we considered terms to 



add to Eqn. 10 in order for the units of the update to match 
the units of the parameters. Since second order methods are 
correct, we rearrange Newton's method (assuming a diagonal 
Hessian) for the inverse of the second derivative to determine 
the quantities involved: 

1 Ax 
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dx 
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dl 

dx 



(13) 



Since the RMS of the previous gradients is already repre- 



sented in the denominator in Eqn. 10 we considered a mea- 
sure of the Aa; quantity in the numerator. Ax t for the current 
time step is not known, so we assume the curvature is locally 
smooth and approximate Axt by compute the exponentially 
decaying RMS over a window of size w of previous Ax to 
give the ADADELTA method: 



* RMS[Ax] t _! 

Ax t = T,wr.r_l 9t 



(14) 



RMS[ 5 ] t 

where the same constant e is added to the numerator RMS as 
well. This constant serves the purpose both to start off the first 



iteration where Axo = and to ensure progress continues to 
be made even if previous updates become small. 

This derivation made the assumption of diagonal curva- 
ture so that the second derivatives could easily be rearranged. 
Furthermore, this is an approximation to the diagonal Hessian 
using only RMS measures of g and Ax. This approximation 
is always positive as in Becker and LeCun 0, ensuring the 
update direction follows the negative gradient at each step. 



In Eqn. 14 the RMS[Ax] t _i quantity lags behind the de- 
nominator by 1 time step, due to the recurrence relationship 
for Ax t . An interesting side effect of this is that the system is 
robust to large sudden gradients which act to increase the de- 
nominator, reducing the effective learning rate at the current 
time step, before the numerator can react. 



The method in Eqn. 14 uses only first order information 
and has some properties from each of the discussed meth- 
ods. The negative gradient direction for the current iteration 
— g t is always followed as in SGD. The numerator acts as 
an acceleration term, accumulating previous gradients over a 
window of time as in momentum. The denominator is re- 
lated to ADAGRAD in that the squared gradient information 
per-dimension helps to even out the progress made in each di- 
mension, but is computed over a window to ensure progress 
is made later in training. Finally, the method relates to Schaul 
et al. 's in that some approximation to the Hessian is made, 
but instead costs only one gradient computation per iteration 
by leveraging information from past updates. For the com- 
plete algorithm details see Algorithm [T] 

4. EXPERIMENTS 

We evaluate our method on two tasks using several different 
neural network architectures. We train the neural networks 
using SGD, Momentum, ADAGRAD, and ADADELTA in a 
supervised fashion to minimize the cross entropy objective 
between the network output and ground truth labels. Compar- 
isons are done both on a local computer and in a distributed 
compute cluster. 

4.1. Handwritten Digit Classification 

In our first set of experiments we train a neural network on the 
MNIST handwritten digit classification task. For comparison 
with Schaul et al. 's method we trained with tanh nonlinear- 
ities and 500 hidden units in the first layer followed by 300 
hidden units in the second layer, with the final softmax out- 
put layer on top. Our method was trained on mini-batches of 
100 images per batch for 6 epochs through the training set. 
Setting the hyperparameters to e = le — 6 and p = 0.95 we 
achieve 2.00% test set error compared to the 2.10% of Schaul 
et al. While this is nowhere near convergence it gives a sense 
of how quickly the algorithms can optimize the classification 
objective. 
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Fig. 1. Comparison of learning rate methods on MNIST digit 
classification for 50 epochs. 

To further analyze various methods to convergence, we 
train the same neural network with 500 hidden units in the first 
layer, 300 hidden units in the second layer and rectified linear 
activation functions in both layers for 50 epochs. We notice 
that rectified linear units work better in practice than tanh, and 
their non-saturating nature further tests each of the methods 
at coping with large variations of activations and gradients. 

In Fig. [T] we compare SGD, Momentum, ADAGRAD, 
and ADADELTA in optimizing the test set errors. The unal- 
tered SGD method does the worst in this case, whereas adding 
the momentum term to it significantly improves performance. 
ADAGRAD performs well for the first 10 epochs of training, 
after which it slows down due to the accumulations in the de- 
nominator which continually increase. ADADELTA matches 
the fast initial convergence of ADAGRAD while continuing 
to reduce the test error, converging near the best performance 
which occurs with momentum. 
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Table 1. MNIST test error rates after 6 epochs of training for 
various hyperparameter settings using SGD, MOMENTUM, 
and ADAGRAD. 
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Table 2. MNIST test error rate after 6 epochs for various 
hyperparameter settings using ADADELTA. 



4.2. Sensitivity to Hyperparameters 

While momentum converged to a better final solution than 
ADADELTA after many epochs of training, it was very sen- 
sitive to the learning rate selection, as was SGD and ADA- 
GRAD. In Table[T]we vary the learning rates for each method 
and show the test set errors after 6 epochs of training using 
rectified linear units as the activation function. The optimal 
settings from each column were used to generate Fig.[T] With 
SGD, Momentum, or ADAGRAD the learning rate needs to 
be set to the correct order of magnitude, above which the so- 
lutions typically diverge and below which the optimization 
proceeds slowly. We can see that these results are highly vari- 
able for each method, compared to ADADELTA in Table [2] 
in which the two hyperparameters do not significantly alter 
performance. 

4.3. Effective Learning Rates 

To investigate some of the properties of ADADELTA we plot 
in Fig.[2]the step sizes and parameter updates of 10 randomly 
selected dimensions in each of the 3 weight matrices through- 
out training. There are several interesting things evident in 
this figure. First, the step sizes, or effective learning rates (all 



ent problem in neural networks and thus should have larger 
learning rates. 

Secondly, near the end of training these step sizes con- 
verge to 1. This is typically a high learning rate that would 
lead to divergence in most methods, however this conver- 
gence towards 1 only occurs near the end of training when the 
gradients and parameter updates are small. In this scenario, 
the e constants in the numerator and denominator dominate 
the past gradients and parameter updates, converging to the 
learning rate of 1 . 

This leads to the last interesting property of ADADELTA 
which is that when the step sizes become 1, the parameter 
updates (shown on the right of Fig. [2j tend towards zero. This 
occurs smoothly for each of the weight matrices effectively 
operating as if an annealing schedule was present. 

However, having no explicit annealing schedule imposed 
on the learning rate could be why momentum with the proper 
hyperparameters outperforms ADADELTA later in training as 
seen in Fig. [T] With momentum, oscillations that can occur 
near a minima are smoothed out, whereas with ADADELTA 
these can accumulate in the numerator. An annealing sched- 
ule could possibly be added to the ADADELTA method to 
counteract this in future work. 



terms except g t from Eqn. 14 1 shown in the left portion of the 4.4. Speech Data 



figure are larger for the lower layers of the network and much 
smaller for the top layer at the beginning of training. This 
property of ADADELTA helps to balance the fact that lower 
layers have smaller gradients due to the diminishing gradi- 
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Fig. 2. Step sizes and parameter updates shown every 60 
batches during training the MNIST network with tanh non- 
linearities for 25 epochs. Left: Step sizes for 10 randomly 
selected dimensions of each of the 3 weight matrices of the 
network. Right: Parameters changes for the same 10 dimen- 
sions for each of the 3 weight matrices. Note the large step 
sizes in lower layers that help compensate for vanishing gra- 
dients that occur with backpropagation. 



In the next set of experiments we trained a large-scale neu- 
ral network with 4 hidden layers on several hundred hours 
of US English data collected using Voice Search, Voice IME, 
and read data. The network was trained using the distributed 
system of (4j in which a centralized parameter server accu- 
mulates the gradient information reported back from several 
replicas of the neural network. In our experiments we used ei- 
ther 100 or 200 such replica networks to test the performance 
of ADADELTA in a highly distributed environment. 

The neural network is setup as in Q where the inputs 
are 26 frames of audio, each consisting of 40 log-energy fil- 
ter bank outputs. The outputs of the network were 8,000 
senone labels produced from a GMM-HMM system using 
forced alignment with the input frames. Each hidden layer 
of the neural network had 2560 hidden units and was trained 
with either logistic or rectified linear nonlinearities. 

Fig.[3]shows the performance of the ADADELTA method 
when using 100 network replicas. Notice our method ini- 
tially converges faster and outperforms ADAGRAD through- 
out training in terms of frame classification accuracy on the 
test set. The same settings of e = le~ 6 and p = 0.95 from 
the MNIST experiments were used for this setup. 

When training with rectified linear units and using 200 
model replicas we also used the same settings of hyperpa- 
rameters (see Fig. [4]). Despite having 200 replicates which 
inherently introduces significants amount of noise to the gra- 
dient accumulations, the ADADELTA method performs well, 
quickly converging to the same frame accuracy as the other 
methods. 
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Fig. 3. Comparison of ADAGRAD and ADADELTA on the 
Speech Dataset with 100 replicas using logistic nonlinearities. 
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Fig. 4. Comparison of ADAGRAD, Momentum, and 
ADADELTA on the Speech Dataset with 200 replicas using 
rectified linear nonlinearities. 



5. CONCLUSION 



In this tech report we introduced a new learning rate method 
based on only first order information which shows promis- 
ing result on MNIST and a large scale Speech recognition 
dataset. This method has trivial computational overhead com- 
pared to SGD while providing a per-dimension learning rate. 
Despite the wide variation of input data types, number of hid- 
den units, nonlinearities and number of distributed replicas, 
the hyperparameters did not need to be tuned, showing that 
ADADELTA is a robust learning rate method that can be ap- 
plied in a variety of situations. 
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